Feature Transform

Load Audio

openspeech.data.audio.load.load_audio(audio_path: str, sample_rate: int, del_silence: bool = False)numpy.ndarray[source]

Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.

Spectrogram Feature Transform

class openspeech.data.audio.spectrogram.spectrogram.SpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create a spectrogram from a audio signal.

Parameters

configs (DictConfig) – configuraion set

Returns

A spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Spectrogram Feature Transform Configuration

class openspeech.data.audio.spectrogram.configuration.SpectrogramConfigs(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]

This is the configuration class to store the configuration of a SpectrogramTransform.

It is used to initiated an SpectrogramTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters
  • name (str) – name of feature transform. (default: spectrogram)

  • sample_rate (int) – sampling rate of audio (default: 16000)

  • frame_length (float) – frame length for spectrogram (default: 20.0)

  • frame_shift (float) – length of hop between STFT (default: 10.0)

  • del_silence (bool) – flag indication whether to apply delete silence or not (default: False)

  • num_mels (int) – the number of mfc coefficients to retain. (default: 161)

  • apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)

  • apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)

  • apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)

  • apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

Mel-Spectrogram Feature Transform

class openspeech.data.audio.melspectrogram.melspectrogram.MelSpectrogramFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Parameters

configs (DictConfig) – configuraion set

Returns

A mel-spectrogram feature. The shape is (seq_length, num_mels)

Return type

Tensor

Mel-Spectrogram Feature Transform Configuration

class openspeech.data.audio.melspectrogram.configuration.MelSpectrogramConfigs(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]

This is the configuration class to store the configuration of a MelSpectrogramFeatureTransform.

It is used to initiated an MelSpectrogramFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters
  • name (str) – name of feature transform. (default: melspectrogram)

  • sample_rate (int) – sampling rate of audio (default: 16000)

  • frame_length (float) – frame length for spectrogram (default: 20.0)

  • frame_shift (float) – length of hop between STFT (default: 10.0)

  • del_silence (bool) – flag indication whether to apply delete silence or not (default: False)

  • num_mels (int) – the number of mfc coefficients to retain. (default: 80)

  • apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)

  • apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)

  • apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)

  • apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

Filter-Bank Feature Transform

class openspeech.data.audio.filter_bank.filter_bank.FilterBankFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.

Parameters

configs (DictConfig) – hydra configuraion set

Inputs:

signal (np.ndarray): signal from audio file.

Returns

A fbank identical to what Kaldi would output. The shape is (seq_length, num_mels)

Return type

Tensor

Filter-Bank Feature Transform Configuration

class openspeech.data.audio.filter_bank.configuration.FilterBankConfigs(name: str = 'fbank', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]

This is the configuration class to store the configuration of a FilterBankFeatureTransform.

It is used to initiated an FilterBankFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Parameters
  • name (str) – name of feature transform. (default: fbank)

  • sample_rate (int) – sampling rate of audio (default: 16000)

  • frame_length (float) – frame length for spectrogram (default: 20.0)

  • frame_shift (float) – length of hop between STFT (default: 10.0)

  • del_silence (bool) – flag indication whether to apply delete silence or not (default: False)

  • num_mels (int) – the number of mfc coefficients to retain. (default: 80)

  • apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)

  • apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)

  • apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)

  • apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)

MFCC Feature Transform

class openspeech.data.audio.mfcc.mfcc.MFCCFeatureTransform(configs: omegaconf.dictconfig.DictConfig)[source]

Create the Mel-frequency cepstrum coefficients from an audio signal.

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters

configs (DictConfig) – configuraion set

Returns

A mfcc feature. The shape is (seq_length, num_mels)

Return type

Tensor

MFCC Feature Transform Configuration

class openspeech.data.audio.mfcc.configuration.MFCCConfigs(name: str = 'mfcc', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 40, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]

This is the configuration class to store the configuration of a MFCCFeatureTransform.

It is used to initiated an MFCCFeatureTransform feature transform.

Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.

Parameters
  • name (str) – name of feature transform. (default: mfcc)

  • sample_rate (int) – sampling rate of audio (default: 16000)

  • frame_length (float) – frame length for spectrogram (default: 20.0)

  • frame_shift (float) – length of hop between STFT (default: 10.0)

  • del_silence (bool) – flag indication whether to apply delete silence or not (default: False)

  • num_mels (int) – the number of mfc coefficients to retain. (default: 40)

  • apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)

  • apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)

  • apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)

  • apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)